[WIP] Pruned-transducer-stateless5-for-WenetSpeech (offline and streaming) #447

luomingshuang · 2022-06-26T12:30:26Z

A pruned-transducer-stateless5 recipe for WenetSpeech.

luomingshuang · 2022-07-13T03:31:11Z

Note: The training is running, I will update the results on time.

Results trained with the M subset:
Offline:

decoding-method	epoch	avg	use-averaged-model	dev-CER(%)	test-net-CER(%)	test-meeting-CER(%)
greedy_search	15	9	True	10.43	11.45	19.37
modified_beam_search	15	9	True	10.03	11.13	18.36
fast_beam_search	15	9	True	10.38	11.28	19.5

Streaming:

decoding-method	epoch	avg	use-averaged-model	dev-CER(%)	test-net-CER(%)	test-meeting-CER(%)
greedy_search	15	6	True	11.23	13.12	21.34
fast_beam_search	15	6	True	11.12	12.85	21.18

Results trained with the L subset:
Offline:

decoding-method	epoch	avg	use-averaged-model	dev-CER(%)	test-net-CER(%)	test-meeting-CER(%)
greedy_search	1	1	False	12.18	14.31	25.31
greedy_search	2	1	False	10.6	12.33	20.77
greedy_search	2	1	True	8.79	10.6	15.89
greedy_search	3	1	True	8.89	9.56	16.42
greedy_search	4	1	True	8.22	9.03	14.54
modified_beam_search	4	1	True	8.17	9.04	14.44
fast_beam_search	4	1	True	8.29	9.00	14.93
greedy_search	4	2	True	8.40	9.37	14.89
greedy_search	5	1	True	8.30	8.85	14.57
greedy_search	5	2	True	8.48	9.19	14.81
greedy_search	6	1	True	9.70	9.36	17.24

Streaming (decode.py):

decoding-method	epoch	avg	use-averaged-model	dev-CER(%)	test-net-CER(%)	test-meeting-CER(%)
greedy_search	1	1	False	14.27	17.45	30.80
greedy_search	2	1	False	12.77	15.03	25.83
greedy_search	3	1	False	12.28	13.98	25.07
greedy_search	4	1	False	11.30	12.81	21.83
greedy_search	5	1	False	11.22	12.52	21.83
greedy_search	5	1	True	8.88	10.60	16.68
greedy_search	6	1	True	8.93	10.22	16.26
greedy_search	7	1	True	8.86	9.89	15.93
greedy_search	8	1	True	9.76	11.73	17.66
greedy_search	8	2	True	12.33	14.53	21.25
greedy_search	7	2	True	17.61	19.48	30.03
greedy_search	5	2	True	12.02	14.93	22.46

Streaming (streaming_decode.py):

decoding-method	epoch	avg	use-averaged-model	dev-CER(%)	test-net-CER(%)	test-meeting-CER(%)
greedy_search	5	1	True	8.79	10.80	16.82
greedy_search	5	2	True	11.9	14.93	22.36
greedy_search	6	1	True	8.79	10.47	16.34
greedy_search	7	1	True	8.78	10.12	16.16
greedy_search	8	1	True	9.64	11.77	17.57
fast_beam_search	6	1	True	8.85	10.69	16.34
wenet (ctc greedy search - full)	/	/	/	8.85	9.78	17.77
wenet (ctc greedy search - 16)	/	/	/	9.32	11.02	18.79
wenet (ctc prefix beam search - full)	/	/	/	8.80	9.73	17.57
wenet (ctc prefix beam search - 16)	/	/	/	9.25	10.96	18.62
wenet (attention rescoring - full)	/	/	/	8.60	9.26	17.34
wenet (attention rescoring - 16)	/	/	/	8.87	10.22	18.11

danpovey · 2022-07-22T01:42:53Z

You have a result there with --use-averaged-model True and --avg 1. I believe that option only supports --avg value greater than one?? Or perhaps we changed this at some point or I misremembered?

csukuangfj · 2022-07-22T01:59:45Z

You have a result there with --use-averaged-model True and --avg 1. I believe that option only supports --avg value greater than one?? Or perhaps we changed this at some point or I misremembered?

~~It must be a typo in the table, I believe.~~

Edited: The code does not support --epoch 1 --avg 1 --use-averaged-model 1,
but it supports --epoch X --avg 1 --use-averaged-model 1 where X > 1.

luomingshuang · 2022-07-22T10:38:32Z

Training commands:

Offline:

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"

python ./pruned_transducer_stateless5/train.py \
  --lang-dir data/lang_char \
  --exp-dir pruned_transducer_stateless5/exp_L_offline \
  --world-size 8 \
  --num-epochs 15 \
  --start-epoch 2 \
  --max-duration 120 \
  --valid-interval 3000 \
  --model-warm-step 3000 \
  --save-every-n 8000 \
  --average-period 1000 \
  --training-subset L

Streaming:

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"

python ./pruned_transducer_stateless5/train.py \
  --lang-dir data/lang_char \
  --exp-dir pruned_transducer_stateless5/exp_L_streaming \
  --world-size 8 \
  --num-epochs 15 \
  --start-epoch 1 \
  --max-duration 140 \
  --valid-interval 3000 \
  --model-warm-step 3000 \
  --save-every-n 8000 \
  --average-period 1000 \
  --training-subset L \
  --dynamic-chunk-training True \
  --causal-convolution True \
  --short-chunk-size 25 \
  --num-left-chunks 4

Decoding commands:

Offline:

export CUDA_VISIBLE_DEVICES='0'

python ./pruned_transducer_stateless5/decode.py \
  --lang-dir data/lang_char \
  --exp-dir pruned_transducer_stateless5/exp_L_offline \
  --use-averaged-model True \
  --max-duration 600 \
  --epoch 4 \
  --avg 1 \
  --decoding-method greedy_search \
  --simulate-streaming 0 \
  --causal-convolution 0 \
  --decode-chunk-size 16 \
  --left-context 64

Streaming:

export CUDA_VISIBLE_DEVICES='0'

python pruned_transducer_stateless5/streaming_decode.py \
        --epoch 6 \
        --avg 1 \
        --decode-chunk-size 16 \
        --left-context 64 \
        --right-context 0 \
        --exp-dir ./pruned_transducer_stateless5/exp_L_streaming \
        --use-averaged-model True \
        --decoding-method greedy_search \
        --num-decode-streams 200

luomingshuang · 2022-07-26T12:49:48Z

Best Results Conclusion:

Offline:

decoding-method	epoch	avg	use-averaged-model	DEV	TEST-NET	TEST-MEETING
greedy_search	4	1	True	8.22	9.03	14.54
modified_beam_search	4	1	True	8.17	9.04	14.44
fast_beam_search	4	1	True	8.29	9.00	14.93
pruned-rnnt2_greedy search	10	2	/	7.80	8.75	13.49
pruned-rnnt2_modified beam search (beam size 4)	10	2	/	7.76	8.71	13.41
pruned-rnnt2_fast beam search (set as default)	10	2	/	7.94	8.74	13.80
wenet (ctc greedy search - full)	/	/	/	8.85	9.78	17.77
wenet (ctc prefix beam search - full)	/	/	/	8.80	9.73	17.57
wenet (attention rescoring - full)	/	/	/	8.60	9.26	17.34

Streaming:

decoding-method	epoch	avg	use-averaged-model	dev-CER(%)	test-net-CER(%)	test-meeting-CER(%)
greedy_search	7	1	True	8.78	10.12	16.16
modified_beam_search	7	1	True	8.53	9.95	15.81
fast_beam_search	7	1	True	9.01	10.47	16.28
wenet (ctc greedy search - 16)	/	/	/	9.32	11.02	18.79
wenet (ctc prefix beam search - 16)	/	/	/	9.25	10.96	18.62
wenet (attention rescoring - 16)	/	/	/	8.87	10.22	18.11

luomingshuang · 2022-07-27T07:27:02Z

It seems that there are no specially many deletions in the start of the recognized results based on pruned rnnt5 comparing to pruned rnnt2.

danpovey · 2022-07-28T02:28:24Z

It seems that there are no specially many deletions in the start of the recognized results based on pruned rnnt5 comparing to pruned rnnt2.

Yes, we realized that the issue only affects my setup because the "padding_idx=0" arg to nn.Embedding makes the initial parameters and grads zero for the 0th row of the embedding tensor. So it turns out, other optimizers are unaffected.

wgb14 · 2022-07-31T01:30:13Z

There are many results, can you or someone else please point to the specific comparison where it is worse? We should investigate if so. I'd like to see whether the training data contains any instances of repeated words that are not spoken, this might explain the repeated words in the transcript.

Is there any insight why this recipe performed worse than pruned_transducer_stateless2 in batch mode?

Sure, I meant the WER numbers. Copied from https://github.com/k2-fsa/icefall/blob/master/egs/wenetspeech/ASR/RESULTS.md:

Pruned Transducer 5:

Using the codes from this PR #447.

When training with the L subset, the CERs are

Offline:

decoding-method	epoch	avg	use-averaged-model	DEV	TEST-NET	TEST-MEETING
greedy_search	4	1	True	8.22	9.03	14.54
modified_beam_search	4	1	True	8.17	9.04	14.44
fast_beam_search	4	1	True	8.29	9.00	14.93

Pruned Transducer 2:

Using the codes from this PR #349.

When training with the L subset, the CERs are

	dev	test-net	test-meeting	comment
greedy search	7.80	8.75	13.49	--epoch 10, --avg 2, --max-duration 100
modified beam search (beam size 4)	7.76	8.71	13.41	--epoch 10, --avg 2, --max-duration 100
fast beam search (1best)	7.94	8.74	13.80	--epoch 10, --avg 2, --max-duration 1500
fast beam search (nbest)	9.82	10.98	16.37	--epoch 10, --avg 2, --max-duration 600
fast beam search (nbest oracle)	6.88	7.18	11.77	--epoch 10, --avg 2, --max-duration 600
fast beam search (nbest LG, ngram_lm_scale=0.35)	8.83	9.88	15.47	--epoch 10, --avg 2, --max-duration 600

You can see all 3 decoding method perform worse. Here I assume offline is batch mode and this is a fair comparison

wgb14 · 2022-07-31T01:41:17Z

I asked because we want to train a streaming model on gigaspeech, and I'm trying to find a proper recipe to start with.

danpovey · 2022-07-31T01:43:49Z

Oh, that is not good. @luomingshuang is there any chance you could show me a tensorboard log that compares the 2 runs?
One possibility is that we have moved to a new sampler that does not globally randomize the samples, and they are in some very nonrandom order that interacts badly with training. For librispeech this seemed to help slightly, but it may not be the same here necessarily. It could be other things thouhgh, we need to investigate.

One thing I notice is that with the old model, the best results are obtained at a much later epoch. It could be that something is going wrong with the model averaging. @luomingshuang can you include some results with --avg 1 and --use-averaged-model False, for all the epochs that we have? And please remind me of the locations of the new and old systems and run ~dpovey/bin/analyze.py for all the epochs that you have, so I can have a look and see if anything weird is going on with the norms of the parameter values. It would also be great if you could run with --print-diagnostics=True and save the output, for all the epochs you have, for the new model at least, so I can see if anything weird is visible in those diagnostics.

danpovey · 2022-07-31T01:45:59Z

Also @luomingshuang, for the "Offline" results reported in this PR, were those obtained with the same model as the streaming results? If so, could the fact that this is a streaming model explain the WER increase vs. pruned_transducer_stateless2? (However, it could not explain the degradation with increasing epoch).

pzelasko · 2022-07-31T02:23:52Z

Oh, that is not good. @luomingshuang is there any chance you could show me a tensorboard log that compares the 2 runs? One possibility is that we have moved to a new sampler that does not globally randomize the samples, and they are in some very nonrandom order that interacts badly with training. For librispeech this seemed to help slightly, but it may not be the same here necessarily. It could be other things thouhgh, we need to investigate.

In case of large datasets like Gigaspeech I recommend shuffling the manifests with shuf or sth before starting the training.

csukuangfj · 2022-07-31T02:28:25Z

You can see all 3 decoding method perform worse. Here I assume offline is batch mode and this is a fair comparison

Please pay attention to the number of epochs that has been trained.

The worse one is trained only for 4 epochs while the better one is 10 epochs.

wgb14 · 2022-07-31T03:30:38Z

Please pay attention to the number of epochs that has been trained.

The worse one is trained only for 4 epochs while the better one is 10 epochs.

Make sense. Do you have numbers from the last epoch?

danpovey · 2022-07-31T07:14:48Z

Oh, that is not good. @luomingshuang is there any chance you could show me a tensorboard log that compares the 2 runs? One possibility is that we have moved to a new sampler that does not globally randomize the samples, and they are in some very nonrandom order that interacts badly with training. For librispeech this seemed to help slightly, but it may not be the same here necessarily. It could be other things thouhgh, we need to investigate.

In case of large datasets like Gigaspeech I recommend shuffling the manifests with shuf or sth before starting the training.

Guys, we should try this at some point. E.g. on full Librispeech or WeNetSpeech. I don't want to change the recipes without testing it though.

csukuangfj · 2022-07-31T07:18:45Z

Yes, I agree.

luomingshuang · 2022-08-01T07:05:44Z

The comparison of tensorboard logs (pruned rnnt2 (part1 and part2) and pruned rnnt5) https://tensorboard.dev/experiment/4tBm3f1aSvuzDxyuWeYuLQ/#scalars&_smoothingWeight=0.97:

It seems that the training loss in pruned rnnt5 maybe can converge forward if continue to train.

I compare the performance of pruned rnnt2 and pruned rnnt5 for each epoch as follows (the max-durations are different between pruned rnnt2 and pruned rnnt5 training processes):

model	decoding-method	epoch	avg	use-averaged-model	dev (CER%)	test-net(CER%)	test-meeting(CER%)
pruned_rnnt2	greedy_search	0	1	False	12.24	14.52	24.24
pruned_rnnt2	greedy_search	1	1	False	11.07	12.29	23.52
pruned_rnnt2	greedy_search	2	1	False	10.28	11.17	19.79
pruned_rnnt2	greedy_search	3	1	False	10.16	10.82	19.31
pruned_rnnt2	greedy_search	4	1	False	9.69	10.40	18.81
pruned_rnnt2	greedy_search	5	1	False	9.22	9.98	17.98
pruned_rnnt2	greedy_search	6	1	False	9.34	9.84	18.35
pruned_rnnt2	greedy_search	7	1	False	9.50	9.76	17.79
pruned_rnnt2	greedy_search	8	1	False	9.63	9.82	19.84
pruned_rnnt2	greedy_search	9	1	False	9.10	9.32	16.63
pruned_rnnt2	greedy_search	10	1	False	9.07	9.35	17.43
pruned_rnnt5	greedy_search	1	1	False	12.16	14.30	25.30
pruned_rnnt5	greedy_search	2	1	False	10.59	12.33	20.73
pruned_rnnt5	greedy_search	3	1	False	10.01	11.28	19.82
pruned_rnnt5	greedy_search	4	1	False	9.86	10.76	19.12
pruned_rnnt5	greedy_search	5	1	False	9.98	10.62	19.94
pruned_rnnt5	greedy_search	6	1	False	9.30	9.96	17.71
pruned_rnnt5	greedy_search	7	1	False	9.68	10.11	18.53
pruned_rnnt5	greedy_search	8	1	False	10.28	10.00	19.94
pruned_rnnt5	greedy_search	2	1	True	8.79	10.60	15.89
pruned_rnnt5	greedy_search	3	1	True	8.89	9.56	16.42
pruned_rnnt5	greedy_search	4	1	True	8.22	9.03	14.54
pruned_rnnt5	greedy_search	5	1	True	8.30	8.85	14.57
pruned_rnnt5	greedy_search	6	1	True	9.70	9.36	17.24
pruned_rnnt5	greedy_search	7	1	True	8.07	8.79	13.73
pruned_rnnt5	greedy_search	8	1	True	7.97	8.35	13.69
pruned_rnnt5	greedy_search	9	1	True	7.91	8.31	13.56
pruned_rnnt5	modified_beam_search	9	1	True	7.88	8.39	13.66
pruned_rnnt5	fast_beam_search	9	1	True	8.03	8.31	14.07
pruned_rnnt5	greedy_search	10	1	True	7.95	8.43	13.37
pruned_rnnt5	modified_beam_search	10	1	True	7.98	8.56	13.58
pruned_rnnt5	fast_beam_search	10	1	True	8.07	8.40	13.91
pruned_rnnt5	greedy_search	10	2	True	9.52	11.05	14.77
pruned_rnnt5	greedy_search	11	1	True	8.40	8.10	14.54
pruned_rnnt5	greedy_search	12	1	True	8.89	9.85	14.56
pruned_rnnt5	greedy_search	13	1	True	9.45	11.08	15.25

It seems the results between based on use-averaged-model=True and use-averaged-model=False are very different, such as |pruned_rnnt5| greedy_search | 8 | 1| False| 10.28|10.00 |19.94 | and |pruned_rnnt5| greedy_search | 8 | 1| True| 7.79| 8.35 | 13.69 |. Or it is necessary to continue training for some epochs. Sorry, I look for the best results based on use-averaged-model=False, so I thought we can get the best results when epoch=4 and avg=1. Now, it seems we should look for the best results based on use-averaged-model=True

csukuangfj · 2022-08-04T03:30:26Z

egs/wenetspeech/ASR/pruned_transducer_stateless5/train.py

+    else:
+        y = y


Please remove this else branch. it does not make sense.

danpovey · 2022-08-04T07:31:27Z

OK, fine. I suppose you can try more epochs, e.g. at least train until 10 epochs to match the baseline?

luomingshuang · 2022-08-04T08:18:53Z

OK, I will try to train it based on the pretrained epoch-x.pt until 10 epochs or more.

OK, fine. I suppose you can try more epochs, e.g. at least train until 10 epochs to match the baseline?

luomingshuang · 2022-08-08T08:29:12Z

The new results for wenetspeech based on pruned rnnt5 and the comparison of results between pruned rnnt5 and pruned rnnt2 are as follows ( the best results of pruned rnnt5 are more closed to pruned rnnt2's at present ):

model	decoding-method	epoch	avg	use-averaged-model	dev (CER%)	test-net(CER%)	test-meeting(CER%)
pruned_rnnt5	greedy_search	9	1	True	7.91	8.31	13.56
pruned_rnnt5	modified_beam_search	9	1	True	7.88	8.39	13.66
pruned_rnnt5	fast_beam_search	9	1	True	8.03	8.31	14.07
pruned_rnnt2	greedy search	10	2	False	7.80	8.75	13.49
pruned_rnnt2	modified beam search (beam size 4)	10	2	False	7.76	8.71	13.41
pruned_rnnt2	fast beam search (1best)	10	2	False	7.94	8.74	13.80

danpovey · 2022-08-08T22:27:20Z

OK; can you report the results with --avg 2 --use-averaged-model False, for epoch 10 when you get them? Those should in principle be similar to the pruned_rnn2 baseline. The number of parameters is the same?

luomingshuang · 2022-08-09T02:16:35Z

model	decoding-method	epoch	avg	use-averaged-model	dev (CER%)	test-net(CER%)	test-meeting(CER%)
pruned_rnnt5	greedy_search	10	2	True	9.52	11.05	14.77
pruned_rnnt5	greedy_search	10	2	False	10.25	12.34	15.50

According the above table, it seems that the results based on avg=2 get worse than avg=1.

About the number of model parameters:

model	the number of model parameters
pruned rnnt2	88978927
pruned rnnt5	97487351

Is a relatively small and easy neural network better for wenetspeech?

lianrzh · 2022-09-29T03:09:57Z

@danpovey @luomingshuang Is this model pretrained_epoch_7_avg_1.pt the average of epoch6 and epoch7?
I used this model to decode the test datasets and found there was no way to reproduce the results.The model is in ./pruned_transducer_stateless5/exp_L_streaming.
ln -s pretrained_epoch_7_avg_1.pt epoch-7.pt
python ./pruned_transducer_stateless5/streaming_decode.py \ --exp-dir ./pruned_transducer_stateless5/exp_L_streaming \ --use-averaged-model False \ --epoch 7 \ --avg 1 \ --decode-chunk-size 16 \ --left-context 64 \ --right-context 0 \ --decoding-method greedy_search \ --num-decode-streams 2000
The results of my run are shown below.
greedy_search 11.12 best for DEV
greedy_search 12.92 best for TEST_NET
greedy_search 21.86 best for TEST_MEETING

Please help me to find out which process may be wrong. Thank You.

luomingshuang · 2022-09-29T03:17:47Z

pretrained_epoch_7_avg_1.pt means only epoch 7. Maybe you can have a try again with the following command:

export CUDA_VISIBLE_DEVICES='0'

python pruned_transducer_stateless5/streaming_decode.py \
        --epoch 7 \
        --avg 1 \
        --decode-chunk-size 16 \
        --left-context 64 \
        --right-context 0 \
        --exp-dir ./pruned_transducer_stateless5/exp_L_streaming \
        --use-averaged-model True \
        --decoding-method greedy_search \
        --num-decode-streams 200

Note: set use-averaged-model as True.

@danpovey @luomingshuang Is this model pretrained_epoch_7_avg_1.pt the average of epoch6 and epoch7? I used this model to decode the test datasets and found there was no way to reproduce the results.The model is in ./pruned_transducer_stateless5/exp_L_streaming. ln -s pretrained_epoch_7_avg_1.pt epoch-7.pt python ./pruned_transducer_stateless5/streaming_decode.py \ --exp-dir ./pruned_transducer_stateless5/exp_L_streaming \ --use-averaged-model False \ --epoch 7 \ --avg 1 \ --decode-chunk-size 16 \ --left-context 64 \ --right-context 0 \ --decoding-method greedy_search \ --num-decode-streams 2000 The results of my run are shown below. greedy_search 11.12 best for DEV greedy_search 12.92 best for TEST_NET greedy_search 21.86 best for TEST_MEETING

Please help me to find out which process may be wrong. Thank You.

danpovey · 2022-09-29T03:21:53Z

thanks @luomingshuang !
I assume that requires to also make a link to epoch-6.pt?

csukuangfj · 2022-09-29T03:23:08Z

I will try to find epoch-6.pt

luomingshuang · 2022-09-29T03:26:26Z

Em....I am not sure now. If can have a test for this, it will be OK.

csukuangfj · 2022-09-29T04:32:20Z

Here is the exp path on our server:

/ceph-ms/luomingshuang/codes/icefall-wenetspeech-pruned-rnnt5/egs/wenetspeech/ASR/pruned_transducer_stateless5/exp_L_streaming

Post it here as a reminder.

I am creating a PR at

https://huggingface.co/luomingshuang/icefall_asr_wenetspeech_pruned_transducer_stateless5_streaming/upload/main/exp

to upload epoch-6.pt.

csukuangfj · 2022-09-29T04:45:57Z

We have uploaded epoch-6.pt.

Please retry.

lianrzh · 2022-09-29T07:05:14Z

@csukuangfj The size of epoch-6.pt is 1.56G and epoch-7.pt is 390M.
The two models cannot be averaged, and such errors will be reported：
"icefall/checkpoint.py", line 424, in average_checkpoints_with_averaged_model batch_idx_train_end = state_dict_end["batch_idx_train"] KeyError: 'batch_idx_train'

csukuangfj · 2022-09-29T07:14:51Z

Please wait a moment, I will upload epoch-7.pt

luomingshuang · 2022-09-29T07:21:04Z

390M means that the epoch-7.pt checkpoint file just contains model parameters.

csukuangfj · 2022-09-29T09:19:45Z

The size of epoch-6.pt is 1.56G and epoch-7.pt is 390M.

The difference in size is due to that epoch-6.pt is saved by train.py, which also contains optimizer.state_dict().
While pretrained_epoch_7_avg_1.pt is exported by export.py, which contains only model.state_dict().

csukuangfj · 2022-09-29T09:24:56Z

@lianrzh
I just uploaded epoch-7.pt.

Please retry.

I am uploading epoch-{3,4,5}.pt

csukuangfj · 2022-09-29T13:31:11Z

@lianrzh
We have uploaded epoch-3.pt, epoch-4.pt, epoch-5.pt, epoch-6.pt, and epoch-7.pt.

Please try it again.

) * Support running icefall outside of a git tracked directory. (k2-fsa#470) * Support running icefall outside of a git tracked directory. * Minor fixes. * Rand combine update result (k2-fsa#467) * update RESULTS.md * fix test code in pruned_transducer_stateless5/conformer.py * minor fix * delete doc * fix style * Simplified memory bank for Emformer (k2-fsa#440) * init files * use average value as memory vector for each chunk * change tail padding length from right_context_length to chunk_length * correct the files, ln -> cp * fix bug in conv_emformer_transducer_stateless2/emformer.py * fix doc in conv_emformer_transducer_stateless/emformer.py * refactor init states for stream * modify .flake8 * fix bug about memory mask when memory_size==0 * add @torch.jit.export for init_states function * update RESULTS.md * minor change * update README.md * modify doc * replace torch.div() with << * fix bug, >> -> << * use i&i-1 to judge if it is a power of 2 * minor fix * fix error in RESULTS.md * update multi_quantization installation (k2-fsa#469) * update multi_quantization installation * Update egs/librispeech/ASR/pruned_transducer_stateless6/train.py Co-authored-by: Fangjun Kuang <[email protected]> Co-authored-by: Fangjun Kuang <[email protected]> * [Ready] [Recipes] add aishell2 (k2-fsa#465) * add aishell2 * fix aishell2 * add manifest stats * update prepare char dict * fix lint * setting max duration * lint * change context size to 1 * update result * update hf link * fix decoding comment * add more decoding methods * update result * change context-size 2 default * [WIP] Rnn-T LM nbest rescoring (k2-fsa#471) * add compile_lg.py for aishell2 recipe (k2-fsa#481) * Add RNN-LM rescoring in fast beam search (k2-fsa#475) * fix for case of None stats * Update conformer.py for aishell4 (k2-fsa#484) * update conformer.py for aishell4 * update conformer.py * add strict=False when model.load_state_dict * CTC attention model with reworked Conformer encoder and reworked Transformer decoder (k2-fsa#462) * ctc attention model with reworked conformer encoder and reworked transformer decoder * remove unnecessary func * resolve flake8 conflicts * fix typos and modify the expr of ScaledEmbedding * use original beam size * minor changes to the scripts * add rnn lm decoding * minor changes * check whether q k v weight is None * check whether q k v weight is None * check whether q k v weight is None * style correction * update results * update results * upload the decoding results of rnn-lm to the RESULTS * upload the decoding results of rnn-lm to the RESULTS * Update egs/librispeech/ASR/RESULTS.md Co-authored-by: Fangjun Kuang <[email protected]> * Update egs/librispeech/ASR/RESULTS.md Co-authored-by: Fangjun Kuang <[email protected]> * Update egs/librispeech/ASR/RESULTS.md Co-authored-by: Fangjun Kuang <[email protected]> Co-authored-by: Fangjun Kuang <[email protected]> * Update doc to add a link to Nadira Povey's YouTube channel. (k2-fsa#492) * Update doc to add a link to Nadira Povey's YouTube channel. * fix a typo * Add stats about duration and padding proportion (k2-fsa#485) * add stats about duration and padding proportion * add for utt_duration * add stats for other recipes * add stats for other 2 recipes * modify doc * minor change * Add modified_beam_search for streaming decode (k2-fsa#489) * Add modified_beam_search for pruned_transducer_stateless/streaming_decode.py * refactor * modified beam search for stateless3,4 * Fix comments * Add real streamng ci * Fix using G before assignment in pruned_transducer_stateless/decode.py (k2-fsa#494) * Support using aidatatang_200zh optionally in aishell training (k2-fsa#495) * Use aidatatang_200zh optionally in aishell training. * Fix get_transducer_model() for aishell. (k2-fsa#497) PR k2-fsa#495 introduces an error. This commit fixes it. * [WIP] Pruned-transducer-stateless5-for-WenetSpeech (offline and streaming) (k2-fsa#447) * pruned-rnnt5-for-wenetspeech * style check * style check * add streaming conformer * add streaming decode * changes codes for fast_beam_search and export cpu jit * add modified-beam-search for streaming decoding * add modified-beam-search for streaming decoding * change for streaming_beam_search.py * add README.md and RESULTS.md * change for style_check.yml * do some changes * do some changes for export.py * add some decode commands for usage * add streaming results on README.md * [debug] raise remind when git-lfs not available (k2-fsa#504) * [debug] raise remind when git-lfs not available * modify comment * correction for prepare.sh (k2-fsa#506) * Set overwrite=True when extracting features in batches. (k2-fsa#487) * correction for get rank id. (k2-fsa#507) * Fix no attribute 'data' error. * minor fixes * correction for get rank id. * Add other decoding methods (nbest, nbest oracle, nbest LG) for wenetspeech pruned rnnt2 (k2-fsa#482) * add other decoding methods for wenetspeech * changes for RESULTS.md * add ngram-lm-scale=0.35 results * set ngram-lm-scale=0.35 as default * Update README.md * add nbest-scale for flie name * Support dynamic chunk streaming training in pruned_transcuder_stateless5 (k2-fsa#454) * support dynamic chunk streaming training * Add simulate streaming decoding * Support streaming decoding * fix causal * Minor fixes * fix streaming decode; add results * liear_fst_with_self_loops (k2-fsa#512) * Support exporting to ONNX format (k2-fsa#501) * WIP: Support exporting to ONNX format * Minor fixes. * Combine encoder/decoder/joiner into a single file. * Revert merging three onnx models into a single one. It's quite time consuming to extract a sub-graph from the combined model. For instance, it takes more than one hour to extract the encoder model. * Update CI to test ONNX models. * Decode with exported models. * Fix typos. * Add more doc. * Remove ncnn as it is not fully tested yet. * Fix as_strided for streaming conformer. * Convert ScaledEmbedding to nn.Embedding for inference. (k2-fsa#517) * Convert ScaledEmbedding to nn.Embedding for inference. * Fix CI style issues. * Fix preparing char based lang and add multiprocessing for wenetspeech text segmentation (k2-fsa#513) * add multiprocessing for wenetspeech text segmentation * Fix preparing char based lang for wenetspeech * fix style Co-authored-by: WeijiZhuang <[email protected]> * change for pruned rnnt5 train.py (k2-fsa#519) * fix about tensorboard (k2-fsa#516) * fix metricstracker * fix style * Merging onnx models (k2-fsa#518) * add export function of onnx-all-in-one to export.py * add onnx_check script for all-in-one onnx model * minor fix * remove unused arguments * add onnx-all-in-one test * fix style * fix style * fix requirements * fix input/output names * fix installing onnx_graphsurgeon * fix instaliing onnx_graphsurgeon * revert to previous requirements.txt * fix minor * Fix loading sampler state dict. (k2-fsa#421) * Fix loading sampler state dict. * skip scan_pessimistic_batches_for_oom if params.start_batch > 0 * fix torchaudio version (k2-fsa#524) * fix torchaudio version * fix torchaudio version * Fix computing averaged loss in the aishell recipe. (k2-fsa#523) * Fix computing averaged loss in the aishell recipe. * Set find_unused_parameters optionally. * Sort results to make it more convenient to compare decoding results (k2-fsa#522) * Sort result to make it more convenient to compare decoding results * Add cut_id to recognition results * add cut_id to results for all recipes * Fix torch.jit.script * Fix comments * Minor fixes * Fix torch.jit.tracing for Pytorch version before v1.9.0 * Add function display_and_save_batch in wenetspeech/pruned_transducer_stateless2/train.py (k2-fsa#528) * Add function display_and_save_batch in egs/wenetspeech/ASR/pruned_transducer_stateless2/train.py * Modify function: display_and_save_batch * Delete empty line in pruned_transducer_stateless2/train.py * Modify code format * Filter non-finite losses (k2-fsa#525) * Filter non-finite losses * Fixes after review * propagate changes from k2-fsa#525 to other librispeech recipes (k2-fsa#531) * propaga changes from k2-fsa#525 to other librispeech recipes * refactor display_and_save_batch to utils * fixed typo * reformat code style * Fix not enough values to unpack error . (k2-fsa#533) * Use ScaledLSTM as streaming encoder (k2-fsa#479) * add ScaledLSTM * add RNNEncoderLayer and RNNEncoder classes in lstm.py * add RNN and Conv2dSubsampling classes in lstm.py * hardcode bidirectional=False * link from pruned_transducer_stateless2 * link scaling.py pruned_transducer_stateless2 * copy from pruned_transducer_stateless2 * modify decode.py pretrained.py test_model.py train.py * copy streaming decoding files from pruned_transducer_stateless2 * modify streaming decoding files * simplified code in ScaledLSTM * flat weights after scaling * pruned2 -> pruned4 * link __init__.py * fix style * remove add_model_arguments * modify .flake8 * fix style * fix scale value in scaling.py * add random combiner for training deeper model * add using proj_size * add scaling converter for ScaledLSTM * support jit trace * add using averaged model in export.py * modify test_model.py, test if the model can be successfully exported by jit.trace * modify pretrained.py * support streaming decoding * fix model.py * Add cut_id to recognition results * Add cut_id to recognition results * do not pad in Conv subsampling module; add tail padding during decoding. * update RESULTS.md * minor fix * fix doc * update README.md * minor change, filter infinite loss * remove the condition of raise error * modify type hint for the return value in model.py * minor change * modify RESULTS.md Co-authored-by: pkufool <[email protected]> * Update asr_datamodule.py (k2-fsa#538) minor file names correction * minor fixes to LSTM streaming model (k2-fsa#537) * Pruned transducer stateless2 for AISHELL-1 (k2-fsa#536) * Fix not enough values to unpack error . * [WIP] Pruned transducer stateless2 for AISHELL-1 * fix the style issue * code format for black * add pruned-transducer-stateless2 results for AISHELL-1 * simplify result * consider case of empty tensor (k2-fsa#540) * fixed import quantization is none (k2-fsa#541) Signed-off-by: shanguanma <[email protected]> Signed-off-by: shanguanma <[email protected]> Co-authored-by: shanguanma <[email protected]> * fix typo for export jit script (k2-fsa#544) * some small changes for aidatatang_200zh (k2-fsa#542) * Update prepare.sh * Update compute_fbank_aidatatang_200zh.py * fixed no cut_id error in decode_dataset (k2-fsa#549) * fixed import quantization is none Signed-off-by: shanguanma <[email protected]> * fixed no cut_id error in decode_dataset Signed-off-by: shanguanma <[email protected]> * fixed more than one "#" Signed-off-by: shanguanma <[email protected]> * fixed code style Signed-off-by: shanguanma <[email protected]> Signed-off-by: shanguanma <[email protected]> Co-authored-by: shanguanma <[email protected]> * Add clamping operation in Eve optimizer for all scalar weights to avoid (k2-fsa#550) non stable training in some scenarios. The clamping range is set to (-10,2). Note that this change may cause unexpected effect if you resume training from a model that is trained without clamping. * minor changes for correct path names && import module text2segments.py (k2-fsa#552) * Update asr_datamodule.py minor file names correction * minor changes for correct path names && import module text2segments.py * fix scaling converter test for decoder(predictor). (k2-fsa#553) * Disable CUDA_LAUNCH_BLOCKING in wenetspeech recipes. (k2-fsa#554) * Disable CUDA_LAUNCH_BLOCKING in wenetspeech recipes. * minor fixes * Check that read_manifests_if_cached returns a non-empty dict. (k2-fsa#555) * Modified prepare_transcripts.py and preprare_lexicon.py of tedlium3 recipe (k2-fsa#567) * Use modified ctc topo when vocab size is > 500 (k2-fsa#568) * Add LSTM for the multi-dataset setup. (k2-fsa#558) * Add LSTM for the multi-dataset setup. * Add results * fix style issues * add missing file * Adding Dockerfile for Ubuntu18.04-pytorch1.12.1-cuda11.3-cudnn8 (k2-fsa#572) * Changed Dockerfile * Update Dockerfile * Dockerfile * Update README.md * Add Dockerfiles * Update README.md Removed misleading CUDA version, as the Ubuntu18.04-pytorch1.7.1-cuda11.0-cudnn8 Dockerfile can only support CUDA versions >11.0. * support exporting to ncnn format via PNNX (k2-fsa#571) * Small fixes to the transducer training doc (k2-fsa#575) * Update kaldifeat in CI tests (k2-fsa#583) * padding zeros (k2-fsa#591) * Gradient filter for training lstm model (k2-fsa#564) * init files * add gradient filter module * refact getting median value * add cutoff for grad filter * delete comments * apply gradient filter in LSTM module, to filter both input and params * fix typing and refactor * filter with soft mask * rename lstm_transducer_stateless2 to lstm_transducer_stateless3 * fix typos, and update RESULTS.md * minor fix * fix return typing * fix typo * Modified train.py of tedlium3 models (k2-fsa#597) * Add dill to requirements.txt (k2-fsa#613) * Add dill to requirements.txt * Disable style check for python 3.7 * update docs (k2-fsa#611) * update docs Co-authored-by: unknown <[email protected]> Co-authored-by: KajiMaCN <[email protected]> * exporting projection layers of joiner separately for onnx (k2-fsa#584) * exporting projection layers of joiner separately for onnx * Remove all-in-one for onnx export (k2-fsa#614) * Remove all-in-one for onnx export * Exit on error for CI * Modify ActivationBalancer for speed (k2-fsa#612) * add a probability to apply ActivationBalancer * minor fix * minor fix * Support exporting to ONNX for the wenetspeech recipe (k2-fsa#615) * Support exporting to ONNX for the wenetspeech recipe * Add doc about model export (k2-fsa#618) * Add doc about model export * fix typos * Fix links in the doc (k2-fsa#619) * fix type hints for decode.py (k2-fsa#623) * Support exporting LSTM with projection to ONNX (k2-fsa#621) * Support exporting LSTM with projection to ONNX * Add missing files * small fixes * CSJ Data Preparation (k2-fsa#617) * workspace setup * csj prepare done * Change compute_fbank_musan.py t soft link * add description * change lhotse prepare csj command * split train-dev here * Add header * remove debug * save manifest_statistics * generate transcript in Lhotse * update comments in config file * fix number of parameters in RESULTS.md (k2-fsa#627) * Add Shallow fusion in modified_beam_search (k2-fsa#630) * Add utility for shallow fusion * test batch size == 1 without shallow fusion * Use shallow fusion for modified-beam-search * Modified beam search with ngram rescoring * Fix code according to review Co-authored-by: Fangjun Kuang <[email protected]> * Add kaldifst to requirements.txt (k2-fsa#631) * Install kaldifst for GitHub actions (k2-fsa#632) * Install kaldifst for GitHub actions * Update train.py (k2-fsa#635) Add the missing step to add the arguments to the parser. * Fix type hints for decode.py (k2-fsa#638) * Fix type hints for decode.py * Fix flake8 * fix typos (k2-fsa#639) * Remove onnx and onnxruntime from requirements.txt (k2-fsa#640) * Remove onnx and onnxruntime from requirements.txt * Checkout the LM for aishell explicitly (k2-fsa#642) * Get timestamps during decoding (k2-fsa#598) * print out timestamps during decoding * add word-level alignments * support to compute mean symbol delay with word-level alignments * print variance of symbol delay * update doc * support to compute delay for pruned_transducer_stateless4 * fix bug * add doc * remove tail padding for non-streaming models (k2-fsa#625) * support RNNLM shallow fusion for LSTM transducer * support RNNLM shallow fusion in stateless5 * update results * update decoding commands * update author info * update * include previous added decoding method * minor fixes * remove redundant test lines * Update egs/librispeech/ASR/lstm_transducer_stateless2/decode.py Co-authored-by: Fangjun Kuang <[email protected]> * Update tdnn_lstm_ctc.rst (k2-fsa#647) * Update README.md (k2-fsa#649) * Update tdnn_lstm_ctc.rst (k2-fsa#648) * fix torchaudio version in dockerfile (k2-fsa#653) * fix torchaudio version in dockerfile * remove kaldiio * update docs * Add fast_beam_search_LG (k2-fsa#622) * Add fast_beam_search_LG * add fast_beam_search_LG to commonly used recipes * fix ci * fix ci * Fix error * Fix LG log file name (k2-fsa#657) * resolve conflict with timestamp feature * resolve conflicts * minor fixes * remove testing file * Apply delay penalty on transducer (k2-fsa#654) * add delay penalty * fix CI * fix CI * Refactor getting timestamps in fsa-based decoding (k2-fsa#660) * refactor getting timestamps for fsa-based decoding * fix doc * fix bug * add ctc_decode.py * fix doc Signed-off-by: shanguanma <[email protected]> Co-authored-by: Fangjun Kuang <[email protected]> Co-authored-by: LIyong.Guo <[email protected]> Co-authored-by: Yuekai Zhang <[email protected]> Co-authored-by: ezerhouni <[email protected]> Co-authored-by: Mingshuang Luo <[email protected]> Co-authored-by: Daniel Povey <[email protected]> Co-authored-by: Quandwang <[email protected]> Co-authored-by: Wei Kang <[email protected]> Co-authored-by: boji123 <[email protected]> Co-authored-by: Lucky Wong <[email protected]> Co-authored-by: LIyong.Guo <[email protected]> Co-authored-by: Weiji Zhuang <[email protected]> Co-authored-by: WeijiZhuang <[email protected]> Co-authored-by: Yunusemre <[email protected]> Co-authored-by: FNLPprojects <[email protected]> Co-authored-by: yangsuxia <[email protected]> Co-authored-by: marcoyang1998 <[email protected]> Co-authored-by: rickychanhoyin <[email protected]> Co-authored-by: Duo Ma <[email protected]> Co-authored-by: shanguanma <[email protected]> Co-authored-by: rxhmdia <[email protected]> Co-authored-by: kobenaxie <[email protected]> Co-authored-by: shcxlee <[email protected]> Co-authored-by: Teo Wen Shen <[email protected]> Co-authored-by: KajiMaCN <[email protected]> Co-authored-by: unknown <[email protected]> Co-authored-by: KajiMaCN <[email protected]> Co-authored-by: Yunusemre <[email protected]> Co-authored-by: Nagendra Goel <[email protected]> Co-authored-by: marcoyang <[email protected]> Co-authored-by: zr_jin <[email protected]>

luomingshuang added 5 commits June 26, 2022 20:28

pruned-rnnt5-for-wenetspeech

065cbef

style check

77b4224

style check

e038fed

Merge branch 'master' into pruned-rnnt5-for-wenetspeech

80a2e67

add streaming conformer

227f32f

luomingshuang changed the title ~~[WIP] Pruned-transducer-stateless5-for-WenetSpeech~~ [WIP] Pruned-transducer-stateless5-for-WenetSpeech (offline and streaming) Jul 3, 2022

luomingshuang mentioned this pull request Jul 3, 2022

[WIP] Pruned transducer stateless2 with streaming conformer for WenetSpeech #449

Closed

add streaming decode

6d77f4c

changes codes for fast_beam_search and export cpu jit

bd043b0

luomingshuang added 3 commits July 25, 2022 16:07

add modified-beam-search for streaming decoding

fee72b9

add modified-beam-search for streaming decoding

4bf8392

change for streaming_beam_search.py

a5a7514

add README.md and RESULTS.md

ffb25d1

luomingshuang added the ready label Jul 27, 2022

change for style_check.yml

d2581cf

luomingshuang added ready and removed ready labels Jul 27, 2022

do some changes

23be42a

luomingshuang added ready and removed ready labels Jul 27, 2022

do some changes for export.py

44ec59f

luomingshuang added ready and removed ready labels Jul 28, 2022

csukuangfj reviewed Aug 4, 2022

View reviewed changes

[WIP] Pruned-transducer-stateless5-for-WenetSpeech (offline and streaming) #447

[WIP] Pruned-transducer-stateless5-for-WenetSpeech (offline and streaming) #447

Conversation

luomingshuang commented Jun 26, 2022

luomingshuang commented Jul 13, 2022 • edited Loading

danpovey commented Jul 22, 2022 • edited Loading

csukuangfj commented Jul 22, 2022 • edited Loading

luomingshuang commented Jul 22, 2022

luomingshuang commented Jul 26, 2022 • edited Loading

luomingshuang commented Jul 27, 2022

danpovey commented Jul 28, 2022

wgb14 commented Jul 31, 2022

Pruned Transducer 5:

Pruned Transducer 2:

wgb14 commented Jul 31, 2022

danpovey commented Jul 31, 2022

danpovey commented Jul 31, 2022

pzelasko commented Jul 31, 2022

csukuangfj commented Jul 31, 2022

wgb14 commented Jul 31, 2022

danpovey commented Jul 31, 2022

csukuangfj commented Jul 31, 2022

luomingshuang commented Aug 1, 2022 • edited Loading

csukuangfj Aug 4, 2022

Choose a reason for hiding this comment

danpovey commented Aug 4, 2022

luomingshuang commented Aug 4, 2022

luomingshuang commented Aug 8, 2022 • edited Loading

danpovey commented Aug 8, 2022

luomingshuang commented Aug 9, 2022

lianrzh commented Sep 29, 2022

luomingshuang commented Sep 29, 2022 • edited Loading

danpovey commented Sep 29, 2022

csukuangfj commented Sep 29, 2022

luomingshuang commented Sep 29, 2022

csukuangfj commented Sep 29, 2022

csukuangfj commented Sep 29, 2022

lianrzh commented Sep 29, 2022

csukuangfj commented Sep 29, 2022

luomingshuang commented Sep 29, 2022

csukuangfj commented Sep 29, 2022

csukuangfj commented Sep 29, 2022

csukuangfj commented Sep 29, 2022

luomingshuang commented Jul 13, 2022 •

edited

Loading

danpovey commented Jul 22, 2022 •

edited

Loading

csukuangfj commented Jul 22, 2022 •

edited

Loading

luomingshuang commented Jul 26, 2022 •

edited

Loading

luomingshuang commented Aug 1, 2022 •

edited

Loading

luomingshuang commented Aug 8, 2022 •

edited

Loading

luomingshuang commented Sep 29, 2022 •

edited

Loading